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Abstract 

Recommendation systems are emerging as an important business application with significant economic 
impact. Currently popular systems include Amazon's book recommendations, Netffix's movie recom- 
mendations, and Pandora's music recommendations. In this paper we address the problem of estimating 
probabilities associated with recommendation system data using non-parametric kernel smoothing. In 
our estimation we interpret missing items as randomly censored observations and obtain efficient com- 
putation schemes using combinatorial properties of generating functions. We demonstrate our approach 
with several case studies involving real world movie recommendation data. The results are comparable 
with state-of-the-art techniques while also providing probabilistic preference estimates outside the scope 
of traditional recommender systems. 



1 Introduction 

Recommendation systems are emerging as an important business application with significant economic im- 
pact. The data in such systems are collections of incomplete tied preferences across n items associated with 
m different users. Given an incomplete tied preference associated with an additional m + 1 user, the system 
recommends unobserved items to that user based on the preference relations of the m + 1 users. Currently 
deployed recommendation systems include book recommendations at amazon.com, movie recommendations 
at netflix.com, and music recommendations at pandora.com. Constructing accurate recommendation sys- 
tems (that recommend to users items that are truly preferred over other items) is important for assisting 
users as well as increasing business profitability. It is an important unsolved goal in machine learning and 
data mining. 

In most cases of practical interest the number of items n indexed by the system (items may be books, 
movies, songs, etc.) is relatively high in the 10 3 — 10 4 range. Perhaps due the size of n, it is almost always 
the case that each user observes only a small subset of the items, typically in the range 10-100. As a result 
the preference relations expressed by the users are over a small subset of the n items. 

Formally, we have m users providing incomplete tied preference relations on n items 

51 : A hl <A xa < < A 1Ml) 

52 ■ A 2 ,l ~<A 2 ,2 -<•■•-< A 2t k(2) 

■■ (1) 

S m ■ A m ,i <A m ,i -<■••-< A m k ^ 

where A+j C {l,...,n} are sets of items (wlog we identify items with integers l,...,n) defined by the 
following interpretation: user i prefers all items in Aij to all items in Aj^+i. The notation k(i) above is 
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the number of such sets provided by user i. The data ([T]) is incomplete since not all items are necessarily 
observed by each user i.e., (J^W (_ {l^ . . . ; n} and may contain ties since some items are left uncompared, 
i.e., \Aij\ > 1. Recommendation systems recommend items to a new user, denoted as m + 1, based on their 
preference 

-<A m+lt2 -< < A (2) 

and its relation to the preferences of the m users (JTJ) . 

As an illustrative example, assuming n = 9, m = 3, the data 

S 1 ! : 1,8,9^4^2,3,7 
S 2 : 4 -< 2, 3 -< 8 
S 3 : 4,8^2,6,9 

corresponds to A M = {1,8,9}, A lfi = {4}, A h3 = {2,3,7}, A 2 ,i = {4}, A 2 , 2 - {2,3}, A 2 , 3 = {8}, 
A 3) i = {4,8}, A 3:2 = {2,6,9}, and fc(l) = fc(2) = 3, fc(3) = 2. From the data we may guess that item 4 
is relatively popular across the board while some users like item 8 (users 1, 3) and some hate it (user 2). 
Given a new m + 1 user issuing the preference 1 -< 2, 3, 7 we might observe a similar pattern of preference or 
taste as user 1 and recommend to the user item 8. We may also recommend item 4 which has broad appeal 
resulting in the augmentation 

1^2,3,7 i ^ 1,4,8^2,3,7. 

We note that in some cases the preference relations ([1]) arise from users providing numeric scores to items. 
For example, if the users assign 1-5 stars to movies, the set Aij contains all movies that user i assigned 6 — j 
stars to and k(i) = 5 (assuming some movies were assigned to each of the 1, 2, 3, 4, 5 star levels). As pointed 
out by a wide variety of studies in economics and social sciences, such numeric scores are inconsistent among 
different users. We therefore proceed to interpret such data as ordinal rather than numeric. 

A substantial body of literature in computer science has addressed the problem of constructing recom- 
mendation systems. We have attempted to outline the most important and successful approaches in the 
related work section towards the end of this paper. However, none of these previous approaches arc fully 
satisfactory from a statistical perspective: there are no reasonable probability models assumed to generate 
the data and no clear meaningful statistical estimation procedures. We substantiate this argument more 
fully in the related work section. 

In this paper we describe a non-parametric statistical technique for estimating probabilities on preferences 
based on the data ((T|). This technique may be used in recommendation systems in different ways. Its principal 
usage may be to provide a statistically meaningful estimation framework for issuing recommendations (in 
conjunction with decision theory). However, it also leads to other important applications including mining 
association rules, exploratory data analysis, and clustering items and users. Two key observations that we 
make are: (i) incomplete tied preference data may be interpreted as randomly censored permutation data, 
and (ii) using generating functions we are able to provide a computationally efficient scheme for computing 
the estimator in the case of triangular smoothing. 

We proceed in the next sections to describe notations and our assumptions and estimation procedure, 
and follow with case studies demonstrating our approach on real world recommendation systems data. 

2 Definitions and Estimation Framework 

We describe the following notations and conventions for permutations, which are taken from Q where more 
detail may be found. We denote a permutation by listing the items from most preferred to least separated by 

a -<! or symbol: n~ l (l) -< ■k~ 1 {2) < < tt" 1 ^), e.g. tt(1) = 2,tt(2) = 3, tt(3) = 1 is 3 < 1 -< 2. Ranking 

with ties occur when judges do not provide enough information to construct a total order. In particular, 
we define tied rankings as a partition of {1, . . . , n} to k < n disjoint subsets Ai, . . . , Ak C {1, . . . , n} such 
that all items in Ai are preferred to all items in A^+i but no information is provided concerning the relative 
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preference of the items among the sets Ai. We denote such rankings by separating the items in Ai and Aj+i 
with a -< or | notation. For example, the tied ranking A\ = {3}, A2 = {2}, A3 = {1,4} (items 1 and 4 are 
tied for last place) is denoted as 3 -< 2 ^ 1, 4 or 3|2|1, 4. 

Ranking with missing items occur when judges omit certain items from their preference information 
altogether. For example assuming a set of items {1, . . . ,4}, a judge may report a preference 3 -< 2 -< 4, 
omitting altogether item 1 which the judge did not observe or experience. This case is very common in 
situations involving a large number of items n. In this case judges typically provide preference only for the 
I <S n items that they observed or experienced. For example, in movie recommendation systems we may 
have n ~ 10 3 and I ~ 10 1 . 

Rankings can be full (permutations), with ties, with missing items, or with both ties and missing items. 
In either case we denote the rankings using the -< or | notation or using the disjoint sets A±, . . . , A^ notation. 
We also represent tied and incomplete rankings by the set of permutations that are consistent with it. For 
example, 

3-<2^1,4 = {3^2^1^4}u{3-<2-<4-<1} 

3^2^4={1^3^2^4}U{3-<1^2^4}U{3-<2-<l-<4}U{3^2^4-<l} 

are sets of two and four permutations corresponding to tied and incomplete rankings, respectively. 

It is hard to directly posit a coherent probabilistic model on incomplete tied data such as (Q}. Different 
preferences relations are not unrelated to each other: they may subsume one another (for example 1 -< 2 -< 3 
and 1 -< 3), represent disjoint events (for example 1^3 and 3 -< 1), or interact in more complex ways (for 
example 1 -< 2 -< 3 and 1 -< 4 -< 3). A valid probabilistic framework needs to respect the constraints resulting 
from the axioms of probability, e.g., p(l -< 2 -< 3) < p(l -< 3). 

Our approach is to consider the incomplete tied preferences as censored permutations. That is, we assume 
a distribution p(ir) over permutations n G & n (©„ is the symmetric group of permutations of order n) that 
describes the complete without-ties preferences in the population. The data available to the recommender 
system (TTJ is sampled by drawing m iid permutations from p: iri , . . . , 7r m ~ p, followed by censoring to result 
in the observed preferences Si,..., S m 

7Tj~p(7r), Si ~ p(S\TTi), i = l,...,m + l (3) 
J(tt g S)p(tt) 

\S) = ^- (4) 

/ 01 \ / to\ ,q\/ ( \ I{n E S)p(tt) P (S) I(neS)p(S) . . 

p(S\ir) = p(tt\S)p(S)/p(tt) = — = -— (5) 

where p(S) is the probability of observing the censoring S (specifically, it is not equal to J2aes P( a ))- 

Although many approaches for estimating p given Si, ... , S m are possible, experimental evidence point to 
the fact that in recommendation systems with high n, the distribution p does not follow a simple parametric 
form such as the Mallows, Bradley- Terry, or Thurstone models [lH (see Figure Q] for a demonstration how 
parametric assumptions break down with increasing n) . Instead, the distribution p tends to be diffuse and 
multimodal with different probability mass regions corresponding to different types of judges (for example in 
movie preferences probability modes may correspond to genre as fans of drama, action, comedy, etc. having 
similar preferences). 

We therefore propose to estimate the underlying distribution p on permutations using non-parametric 
kernel smoothing. The standard kernel smoothing formula applies to the permutation setting as 

^1 m 

PW = -Y, K h(T(Tr,ni)) 

i=l 

where tti, . . . , n m ~ p, T a distance on permutations such as Kendall's distance and Kh(r) = h^ 1 K(r/h) a 
normalized unimodal function. In the case at hand, however, the observed preferences iTi as well as 7r are 
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Figure 1: Heat map visualization of the density of ranked data using multidimensional scaling with expected 
Kendall's Tau distance. The datasets are APA voting (left, n = 5), Jester (middle, n = 100), and EachMovie 
(right, n = 1628) datasets. None of these cases show a simple parametric form, and the complexity of the 
density increases with the number of items n. This motivates the use of non-parametric estimators for 
modeling preferences over a large number of items. 



replaced with permutations sets Si, ... , S m , R representing incomplete tied preferences 



p(R) = E P« = ~ E E E Q(<r\Si)Kh(T(*, a)) 



(6) 



i=l irGRaeSi 



where q(a\Si) serves as a surrogate for the unknown p(a\Si) oc I (a E Si)p(a) (see ©). Selecting q(a\Si) = 
p(a\Si) would lead to consistent estimation of p{R) in the limit h — > 0, m — > oo assuming positive p(ir),p(S). 
Such a selection, however, is generally impossible since p(tt) and therefore p(a\Si) are unknown. 

In general the specific choice of the surrogate q(a\S) is important as it may influence the estimated 
probabilities. Furthermore, it may cause underestimation or overcstimation of p(R) in the limit of large 
data. An exception occurs when the sets Si, ... , S m are either subsets of R or disjoint from R. In this case 
lim/j_).o Kh(7T, a) = I(n = a) resulting in the following limit (with probability 1 by the strong law of large 
numbers) 



lim lim p(R) = lim — ViYSi C R) V qMSA 

i=l aeSi 
. rn - m 

lim — y^I(SiCR)= lim —^Tl(TT l eR)=p(R). 



rn— >oo 777, 



(7) 

(8) 



Thus, if we our data is comprised of preferences Si that are either disjoint or a subset of R we have consistency 
regardless of the choice of the surrogate q. Such a situation is more realistic when the preference R involves 
a small number of items and the preferences Si, i = 1, . . . , m involve a larger number of items. This is often 
the case for recommendation systems where individuals report preferences over 10-100 items and we are 
mostly interested in estimating probabilities of preferences over fewer items such as i -< j, k or i -< j, k -< I 
(see experiment section). 

The main difficulty with the estimator above is the computation of J2neR J2<jes- q( a \Si)Kh,(T(ir,<j)). 
In the case of high n and only a few observed items k the sets Si,R grow factorially as (n — k)\ making 
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Figure 2: Tricube, triangular, and uniform kernels on R with bandwidth h = 1 (left) and h = 2 (middle). 
Right: triangular kernel on permutations (n = 3). 

a naive computation of ([S]) intractable for all but the smallest n. In the next section we explore efficient 
computations of these sums for a triangular kernel Kh and a uniform q(ir\S). 

3 Computationally Efficient Kernel Smoothing 

In previous work [l(| the estimator (|6|) is proposed for tied (but complete) rankin gs. That work derives 
closed form expressions and efficient computation for (|6]) assuming a Mallows kernel [TjJ 

w4-^(-^)n^; (o) 

where T is Kendall's Tau distance on permutations (below I{x) = 1 for x > and otherwise) 

n-l 

T(n, a) = J2 E 7 ^" 1 ^) - ™ -1 (0)- (10) 

Unfortunately these simplifications do not carry over to the case of incomplete rankings where the sets 
of consistent permutations Si, ... , S m are not cosets of the symmetric group. As a result the problem of 
probability estimation in recommendation systems where n is high and many items are missing is particularly 
challenging. However, as we show below replacing the Mallows kernel © with a triangular kernel leads to 
efficient computation in some cases. Specifically, the triangular kernel on permutation is 

K h (T(ir, <j)) = (1- h~ l T{-K, a)) I(h - T(tt, a)) / C (11) 

where the bandwidth parameter h represent both the support (the kernel is for all larger distances) and 
the inverse slope of the triangle. As we show below the normalization term C is a function of h and may be 
efficiently computed using generating functions. Figure [5] (right panel) displays the linear decay of (|TTj) for 
the simple case of permutations over n = 3 items. 

Combinatorial Generating Function 

Generating functions, a tool from enumerativc combinatorics, allow efficient computation of (j6)) by concisely 
expressing the distribution of distances between permutations. Kendall's tau T(ir, a) is the total number of 
discordant pairs or inversions between it, a [20| and thus its computation becomes a combinatorial counting 
problem. We associate the following generating function with the symmetric group of order n permutations 

n—l j 

G n (z) = UJ2 zk - ( 12 ) 

j=l k=0 



5 



As shown for example in [2(| the coefficient of z k of G n (z), which we denote as [z k ]G n (z), corresponds to 
the number of permutations a for which T(a, ir') = k. For example, the distribution of Kendall's tau T(-, n') 
over all permutations of 3 items is described by Gs(z) = (1 + z)(l + z + z 2 ) = lz° + 2Z 1 + 2z 2 + lz 3 i.e., 
there is one permutation a with T(er, 7r') = 0, two permutations a with T(a,n') = 1, two with T(a, ir') = 2 
and one with T(er, ir') = 3. Another important generating function is 

H n (z) = ^1 = (1 + z + z 2 + z 3 + ■ ■ ■ )G n (z) 

where [z k ]H n (z) represents the number of permutations a for which T(er, ir') < k. 

Proposition 1. The normalization term C{h) is given by C{h) — [z h ]H n (z) — hr 1 \z h ~ x \ ■ 

Proof. The proof factors the non-normalized triangular kernel CKh(TT, a) to I(h~T(ir, a)) and /i~ 1 T(7r, a)I(h— 
T(7r,er)) and making the following observations. First we note that summing the first factor over all per- 
mutations may be counted by \z h ]H n (z). The second observation is that [z k ^ 1 ]G' n (z) is the number of 
permutations a for which T(er, 7r') = fc, multiplied by k. Since we want to sum over that quantity for 
all permutations whose distance is less than h we extract the h — 1 coefficient of the generating function 
G' n (z) £ fc > z k = G' n (z)/(l - z). We thus have 

C = E 1 ~ h ~ 1 E T(n\a) = [z h ]H n (z)-h-^}^M. 

ct:T(tt' ,cr)<h o:T(-k' \cr)<h 

□ 

Proposition 2. The complexity of computing C(h) is 0(n 4 ). 

Proof. We describe a dynamic programming algorithm to compute the coefficients of G n by recursively 
computing the coefficients of Gk from the coefficients of Gk-i, k = 1, . . . , n. The generating function Gk(z) 
has k(k + l)/2 non-zero coefficients and computing each of them (using the coefficients of Gk-i) takes 0(k). 
We thus have 0(fc 3 ) to compute Gk from Gk-i which implies 0(n 4 ) to compute Gk, n = l,...,n. We 
conclude the proof by noting that once the coefficients of G„ are computed the coefficients of H n {z) and 
G n (z)/(1 — z) are computable in 0(n 2 ) as these are simply cumulative weighted sums of the coefficients of 
G n . □ 

Note that computing C{h) for one or many h values may be done offline prior to the arrival of the 
rankings and the need to compute the estimated probabilities. 

Denoting by k the number of items ranked in either S or R or both, the computation of p{tt) in ([5]) requires 
0(k 2 ) online and 0(n 4 ) offline complexity if either non-zero smoothing is performed over the entire data 
i.e., maxTrg/j max" =1 max^gg, 7r) < h or alternatively, we use the modified triangular kernel A^(-7r, a) oc 
(i — /i _1 )T(7r, er) which is allowed to take negative values for the most distant permutations (normalization 
still applies though). 

Proposition 3. For two sets of permutations S, R corresponding to tied-incomplete rankings 

1 ( — 1 1 n_1 ™ 

J2J2 T (^) = ^— £ (l-2 Pij (S))(l-2 Pij (R)) (13) 



ISII-RI ^ ^ y 1 1 4 2 ^ ^ 



Pij(U) 



I( T u(j) ~ T u(i)) i and 3 « r e ranked in U with Tjj(i) ^ Tjj(j) 
1 — Tu ^ + k+1 only i is ranked in U 



fe+1 

t l/2 otherwise 



only j is ranked in U 



with Tu(i) = mhiTrgjy 7r(i), and <pu{i) being the number of items that are tied to i in U . 



ti 



Proof. We note that (fTB")) is an expectation with respect to the uniform measure. We thus start by computing 
the probability pij (U) that i is preferred to j for U — S and U = R under the uniform measure. Five scenarios 
exist for each oipij(U) corresponding to whether each of i and j arc ranked by S 1 , R. Starting with the case 
that i is not ranked and j is ranked, we note that i is equally likely to be preferred to any item or to be 
preferred to. Given the uniform distribution over compatible rankings item j is equally likely to appear in 
positions ry(j), . . . ,rr/(j) + <l>u{j) ~ 1- Thus 

= 1 ru(j) 1 r,(j) + fe(j)-l TvQl + fefl 

Pl ° Mi)k + i + "' + Mj) k + 1 k + l ( > 

Similarly, if j is unknown and i is known then pij + pji = 1 . If both i and j are unknown either ordering 
must be equally likely given the uniform distribution making ptj — 1/2. Finally, if both i and j are known 
Pij = 1,1/2,0 depending on their preference. Given p^, linearity of expectation, and the independence 
between rankings, the change in the expected number of inversions relative to the uniform expectation 
n(n — l)/4 can be found by considering each pair separately, 

ET(i, j) = —P (i and j disagree) — — P (i and j agree) 

= ^(Py^X 1 -PijM) + (1 -Pij(o-))pij(n)) -Pij{o-)Pij(K) - (1 - Pij(<r))(l -PijW)) 
= ^(l-2ft J (a))(l-2 ftj ( 7 r)). 

Summing the n(n — l)/2 components yields the desired quantity. □ 

Corollary 1. Denoting the number of items ranked by cither S or R or both as k, and assuming either 
h > max^gfl max™ =1 maxo-gg; T(cr, 7r) or that the modified triangular kernel K^(w,a) oc (1 — h~ l )T{iT,(j) is 
used, the complexity of computing p(R) in ([6]) (assuming uniform q(ir\Si)) is 0(k 2 ) online and 0(n 4 ) offline. 

Proof. The proof follows from noting that (J6]) reduces to 0{n A ) offline computation of the normalization 
term and 0(k 2 ) online computation of the form (| 13|) . □ 

4 Applications and Case Studies 

We divide our experimental study to three parts. In the first we examine the task of predicting probabilities. 
The remaining two parts use these probabilities for rank prediction and rule discovery. 

In our experiments we used three datasets. The Movielens datase10 contains one million ratings from 
6040 users over 3952 movies. The EachMovie dataseld contains 2.6 million ratings from 74424 users over 
1648 movies. The Netfiix datasetd contains 100 million movie ratings from 480189 users on 17770. In all 
of these datasets users typically rated only a small number of items. Histograms of the distribution of the 
number of votes per user, number of votes per item, and vote distribution appear in Figure [3l 

4.1 Estimating Probabilities 

We consider here the task of estimating p(R) where R is a set of permutations corresponding to a tied 
incomplete ranking. Such estimates may be used to compute conditional estimates P(R\S m +i) which are 
used to predict which augmentations R of SVn+i arc highly probable. For example, given an observed 
preference 3 -< 2 -< 5 we may want to compute p(8 -< 3 ~< 2 ~< 5 1 3 -< 2 -< 5) = p(8 ~< 3 -< 2 -< 5)/p(3 -< 2 -< 5) 
to see whether item 8 should be recommended to the user. 



1 http:/ /www. grouplens.org 

2 http:/ /www. grouplens.org/nodc/76 

3 http:// www.netflixprize.com/community 
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For simplicity we focus in this section on probabilities of simple events such as i -< j or i -< j -< k. The 
next section deals with more complex events. In our experiment, we estimate the probability of i -< j for the 
n = 53 most rated movies in Netflix and m = 10000 users who rate most of these movies. The probability 
matrix of the pairs is shown in Figure 2] where each cell corresponds to the probability of preference between 
a pair of movies determined by row j and column i. In the top left panel the rows and columns are ordered 

by average probability of a movie being preferred to others r(i) = — 1 — with the most preferred movie 

in row and column 1 (top right panel indicates the ordering according to r(i)). In the bottom left panel the 
movies were ordered first by popularity of genres and then by r(i). The bottom right panel indicates that 
ordering. The names, genres, and both ordcrings of all 53 movies appear in Figure [6j 

The three highest movies in terms of r{i) are Lord of the Rings: The Return of the King, Finding 
Nemo, and Lord of the Rings: The Two Towers. The three lowest movies are Maid in Manhattan, Anger 
Management, and The Royal Tcncnbaums. Examining the genre (colors in right panels of Figure [4j> we see 
that family and science fiction are generally preferred to others movies while comedy and romance generally 
receive lower preferences. The drama, action genres are somewhere in the middle. 

Also interesting is the variance of the movie preferences within specific genres. Family movies are generally 
preferred to almost all other movies. Science fiction movies, on the other hand, enjoy high preference overall 
but exhibit a larger amount of variability as a few movies are among the least preferred. Similarly, the 
preference probabilities of action movies are widely spread with some movies being preferred to others and 
others being less preferred. More specifically (see bottom left panel of Figure^) we see that the decay of r(i) 
within genres is linear for family and romance and nonlinear for science fiction, action, drama, and comedy. 
In these last three genres there are a few really "bad" movies that are substantially lower than the rest of 
the curve. Figure |6] shows the full information including titles, genres and ordcrings of the 53 most popular 
movies in Netflix. 

We plot the individual values of p(i -< j) for three movies: Shrek (family), Catch Me If You Can (drama) 
and Napoleon Dynamite (comedy) (Figure [5]) . Comparing the three stem plots we observe that Shrek is 
preferred to almost all other movies, Napoleon Dynamite is less preferred than most other movies, and Catch 
Me If You Can is preferred to some other movies but less preferred than others. Also interesting is the linear 
increase of the stem plots for Catch Me If You Can and Napoleon Dynamite and the non-linear increase of 
the stem plot for Shrek. This is likely a result of the fact that for very popular movies there are only a few 
comparable movies with the rest being very likely to be less preferred movies (p(i -< j) close to 1). 

In a second experiment (see Figure [7]) we compare the predictive behavior of the kernel smoothing estima- 
tor with that of a parametric model (Mallows model) and the empirical measure (frequency of event occurring 
in the m samples) . We evaluate the predictive performance of a probability estimator by separating the data 
to two parts: a training set that is used to construct the estimator and a testing set used for evaluation via 
its loglikelihood. A higher test set loglikelihood indicates that the model assigns high probability to events 
that occurred. Mathematically, this corresponds to approximating the KL divergence between nature and 
the model. Since the Mallows model is intractable for large n we chose in this experiment small values of n: 
3,4,5. 

We observe that the kernel estimator consistently achieves higher test set loglikelihood than the Mallows 
model and the empirical measure. The former is due to the breakdown of parametric assumptions as indicated 
by Figure Q] (note that this happens even for n as low as 3). The latter is due to the superior statistical 
performance of the kernel estimator over the empirical measure. 

4.2 Rank Prediction 

Our task here is to predict ranking of new unseen items for users. We follow the standard procedure in 
collaborative filtering: the set of users is partitioned to two sets, a training set and a testing set. For each of 
the test set users we further split the observed items into two sets: one set used for estimating preferences 
(together with the preferences of the training set users) and the second set to evaluate the performance 
of the prediction [14J. Given a loss function L(i,j) which measures the loss of predicting rank i when 
true rank is j (rank here refers to the number of sets of equivalent items that are more or less preferred 
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Figure 4: Left: The estimated probability of movie i being preferred to movie j. Right: a plot of r(i) = 
~^ 3)l n f° r au movies with color indicating genres. In both panels the movies were ordered by r(i) 
(top row) and first by popularity of genres and then by r(i) (bottom row). 
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The Prb of "Napoleon Dynamite" being preferred to others 
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Figure 5: The value p(i -< j) for all j for three movies: Shrek (left), Catch Me If You Can (middle) and 
Napoleon Dynamite (right). 
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Titles 


Genre 


Ordcrl 


Ordcr2 


Finding Nemo 


6 


2 


1 


Shrek 


6 


4 


2 


The Incrcdiblcs 


6 


5 


3 


Monsters, Inc. 


6 


8 


4 


Shrek II 


6 


9 


5 


LOTR: The Return of the King 


1 


1 


6 


LOTR: The Two Towers 


1 


3 


7 


LOTR: The Fellowship of the Ring 


1 


6 


8 


Spider-Man II 


1 


12 


9 


Spider-Man 


1 


16 


10 


The Day After Tomorrow 


1 


36 


11 


Tomb Raider 


1 


46 


12 


Men in Black II 


1 


47 


13 


Pirates of the Caribbean I 




7 


14 


The Last Samurai 




10 


15 


Man on Fire 


3 


11 


16 


The Bourne Identity 


3 


13 


17 


The Bourne Supremacy 




15 


18 


National Treasure 




17 


19 


The Italian Job 


3 


19 


20 


Kill Bill II 


3 


23 


21 


Kill Bill I 


3 


25 


22 


Minority Report 


3 


31 


23 


S.W.A.T. 


3 


44 


24 


The Fast and the Furious 


3 


45 


25 


Ocean's Eleven 


2 


14 


26 


I, Robot 


2 


20 


27 



Titles 


Genre 


Orderl 


Ordor2 


Mystic River 


2 


21 


28 


Troy 


2 


22 


29 


Catch Me If You Can 


9 


24 


30 


Big Fish 


2 


28 


31 


Collateral 


2 


29 


32 


John Q 


2 


34 


33 


Pearl Harbor 


2 


35 


34 


Swordfish 


2 


39 


35 


Lost in Translation 


2 


48 


36 


50 First Dates 


4 


18 


37 


My Big Fat Greek Wedding 


4 


26 


38 


Something's Gotta Give 


4 


27 


39 


The Terminal 


4 


30 


40 


How to Lose a Guy in 10 Days 


4 


32 


41 


Sweet Home Alabama 


4 


38 


42 


Sideways 


4 


41 


43 


Two Weeks Notice 


4 


43 


44 


Mr. Deeds 


4 


49 


45 


The Wedding Planner 


4 


50 


46 


Maid in Manhattan 


4 


53 


47 


The School of Rock 




33 


48 


Bruce Almighty 




37 


49 


Dodgcball: A True Underdog Story 




40 


50 


Napoleon Dynamite 




42 


51 


The Royal Tcncnbaums 




51 


52 


Anger Management 




52 


53 





Figure 6: The table contains the information of the 53 most popular movies of Nctflix. Columns are movie 
titles, genres, orderl (the ordering in the upper row of Figure [4]) and order2 (the ordering in the bottom row 
of Figure 2J. Genres indicated by numbers from 1 to 6 represent science fiction, drama, action, romance, 
comedy, and family. 
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Figure 7: The test-set log-likelihood for kernel smoothing, Mallows model, and the empirical measure with 
respect to training size m for a small number of items n — 3, 4, 5 (top, middle, bottom rows) on three 
datasets. Both of the Mallows model (which is also intractable for large n which is why n < 5 in the 
experiment) and the empirical measure perform worse than the kernel estimator p. 
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than the current item) we evaluate a prediction rule by the expected loss. We focus on three loss functions: 
Lo(i, j) = if i = j and 1 otherwise, L\(i,j) = which reduces to the standard CF evaluation technique 



described in [14| . and an asymmetric loss function (rows correspond to estimated number of stars (0-5) and 
columns to actual number of stars (0-5) 



/0 3 4 5\ 

2 3 4 

1 2 3 

9 4 1.5 

12 6 3 

\15 8 4.5 0/ 



(15) 



In contrast to the Lq and L\ loss, L e captures the fact that recommending bad movies as good movies is 
worse than recommending good movies as bad. 

For example, consider a test user whose observed preference is 3 -< 4, 5, 6 -< 10, 11, 12 -< 23 -< 40, 50, 60 -< 
100,101. We may withhold the preferences of items 4,11 for evaluation purposes. The recommendation 
systems then predict a rank of 1 for item 4 and a rank of 4 for item 11. Since the true ranking of these items 
are 2 and 3 the absolute value loss is |1 — 2| = 1 and |3 — 4| = 1 respectively. 

In our experiment, we use the kernel estimator p to predict ranks that minimize the posterior loss and thus 
adapts to customized loss functions such as L e . This is an advantage of a probabilistic modeling approach 
over more ad-hoc rule based recommendation systems. 

Figure [5] compares the performance of our estimator to several standard baselines in the collaborative 
filtering literature: two older memory based methods vector similarity (siml), correlation (sim2) e.g., 0], 
and a recent state-of-the-art non- negative matrix (NMF) factorization (gnmf) Q. The kernel smoothing 
estimate performed similar to the state-of-the-art but substantially better than the memory based methods 
to which it is functionally similar. 



4.3 Rule Discovery 

In the third task, we used the estimator p to detect noteworthy association rules of the type i -< j k -< I (if 
% is preferred to j than it is probably the case that k is preferred to I). Such association rules are important 
for both business analytics (devising marketing and manufacturing strategics) and recommendation system 
engineering. Specifically, we used p to select sets of four items i,j,k,l for which the mutual information 
I(i -< j ;k -< I) is maximized. After these sets arc identified wc detected the precise shape of the rule 
(i.e., i -< j k -< I rather than j : ^ i =>• k -< I by examining the summands in the mutual information 
expectation). 

Figure |9] (top) shows the top 10 rules that were discovered. These rules nicely isolate viewer preferences 
for genres such as fantasy, romantic comedies, animation, and action (note however that genre information 
was not used in the rule discovery) . To quantitatively evaluate the rule discovery process we judge a rule i -< 
j =>■ k -< I to be good if i, k are of the same genre and j, I are of the same genre. This quantitative evaluation 
appears in Figure [S] (bottom) where it is contrasted with the same rule discovery process (maximizing mutual 
information) based on the empirical measure. 

In another rule discovery experiment, wc used p to detect association rules of the form i ranked highest =>• 
j ranked second highest by selecting i,j that maximize the score J^^y^ )p M 2 ) between pairs of movies in 
the Netflix data. We similarly detected rules of the form i ranked highest => j ranked lowest by maximizing 
the scores - ^ ]'p(i\ \ ~)=\& s t ) between pairs of movies. 

The left panel of Figure [TU] shows the top 9 rules of 100 most rated movies, which nicely represents 
movie preference of similar type, e.g. romance, comedies, and action. The right of Figure [TO] shows the 
top 9 rules which represents like and dislike of different movie types, e.g. like of romance leads to dislike of 
action/thriller. 

In a third experiment, we used p to construct an undirected graph where vertices are items (Netflix 
movies) and two nodes i,j are connected by an edge if the average score of the rule i ranked highest => 
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Figure 8: The prediction loss (top row: 0/1 loss Lq, middle row: L\ loss, bottom row: asymmetric loss 
L e ) with respect to training size on three datascts. The kernel smoothing estimate performed similar to 
the state-of-the-art gnmf (matrix factorization) but substantially better than the memory based methods to 
which it is functionally similar. 
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Shrek -< LOTR: The Fellowship of the Ring 
Shrek -< LOTR: The Fellowship of the Ring 
Shrek 2 -< LOTR: The Fellowship of the Ring 
Kill Bill 2 -< National Treasure 
Shrek 2 -< LOTR: The Fellowship of the Ring 
LOTR: The Fellowship of the Ring -< Monsters, Inc. 
National Treasure -< Kill Bill 2 

LOTR: The Fellowship of the Ring -< Monsters, Inc. 
How to Lose a Guy in 10 Days -< Kill Bill 2 
I, Robot -< Kill Bill 2 



Shrek 2-< LOTR: The Return of the King 

Shrek 2^ LOTR: The Two Towers 

Shrek ^ LOTR: The Return of the King 

Kill Bill 1 -< I. Robot 

Shrek 2-< LOTR: The Two Towers 

LOTR: The Two Towers^ Shrek 

Pearl Harbor -< Kill Bill 1 

LOTR: The Return of the KingH Shrek 

50 First Dates-< Kill Bill 1 

The Day After Tomorrow -< Kill Bill 1 




Figure 9: Top: top 10 rules discovered by the kernel smoothing estimator on Netflix in terms of maximizing 
mutual information. Bottom: a quantitative evaluation of the rule discovery. The x axis represents the 
number of rules discovered and the y axis represents the frequency of good rules in the discovered rules. 
Here a rule i -< j ' => k -< I is considered good if i, k are of the same genre and j, I are of the same genre. 



j ranked second highest and the rule j ranked highest =>■ i ranked second highest is higher than a certain 
threshold. Figure [TT] shows the graph for the 100 most rated movies in Netflix (only movies with vertex 
degree greater than are shown). The clusters in the graph corresponding to vertex color and numbering 
were obtained using a graph partitioning algorithm and the graph is embedded in a 2-D plane using standard 
graph visualization technique. Within each of the identified clusters movies arc clearly similar with respect 
to genre, while an even finer separation can be observed when looking at specific clusters. For example, 
clusters 6 and 9 both contain comedy movies, where as cluster 6 tends toward slapstick humor and cluster 
9 contains romantic comedies. 



5 Related Work 

Collaborative filtering or recommendation system has been an active research area in computer science since 
the 1990s. The earliest efforts made a prediction for the rating of items based on the similarity of the test 
user and the training users [13, 0, 0] • Specifically, these attempts used similarity measures such as Pearson 
correlation [TtJ and Vector cosine similarity [H, 0] to evaluate the similarity level between different users. 
More recent work includes user and movie clustering 0, [2l[ 22 1, item- item similarities [IH, Bayesian 



networks , dependence network [H| and probabilistic latent variable models QBE 

Most recently, the state of the art methods including the winner of the Netflix competition are based on 

non-negative matrix factorization of the partially observed user-rating matrix. The factorized matrix can be 

used to fill out the unobserved entries in a way similar to latent factor analysis [!, [3, 0, Q . 

Each of the above methods focuses exclusively on user ratings. In some cases item information is available 

(movie genre, actors, directors, etc) which have lead to several approaches that combine voting information 

with item information e.g., fH [TBI 

Our method differs from the methods above in that it constructs a full probabilistic model on preferences, 
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Figure 10: Top rules discovered by kernel smoothing estimate on Netflix. Left: like A => like B. Right: 
like A => dislike B. 



it is able to handle heterogeneous preference information (not all users must specify the same number of 
preference classes) and does not make any parametric assumptions. In contrast to previous approaches it 
enables not only the prediction of item ratings, but also the discovery of association rules and the estimation 
of probabilities of interesting events. 



6 Summary 

Estimating distributions from tied and incomplete data is a central task in many applications with perhaps 
the most obvious one being collaborative filtering. An accurate estimator p enables going beyond the 
traditional item-rank prediction task. It can be used to compute probabilities of interest, find association 
rules, and perform a wide range of additional data analysis tasks. 

We demonstrate the first non-parametric estimator for such data that is computationally tractable i.e., 
polynomial rather than exponential in n. The computation is made possible using generating function and 
dynamic programming techniques. 

We examine the behavior of the estimator p in three sets of experiments. The first set of experiments 
involves estimating probabilities of interest such as p(i -< j). The second set of experiments involves pre- 
dicting preferences of held-out items which is directly applicable in recommendation systems. In this task, 
our estimator outperforms other memory based methods (to which it is similar functionally) and performs 
similarly to state-of-the-art methods that are based on non-negative matrix factorization. In the third set of 
experiments we examined the usage of the estimator in discovering association rules such as i -< j =>- k ~< I. 
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Figure 11: A graph corresponding to the 100 most rated Netflix movies where edges represent high affinity 
as determined by the rule discovery process (see text for more details). 
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